Acoustic-visual synthesis technique using bimodal unit-selection
نویسندگان
چکیده
This paper presents a bimodal acoustic-visual synthesis technique that concurrently generates the acoustic speech signal and a 3D animation of the speaker’s outer face. This is done by concatenating bimodal diphone units that consist of both acoustic and visual information. In the visual domain, we mainly focus on the dynamics of the face rather than on rendering. The proposed technique overcomes the problems of asynchrony and incoherence inherent in classic approaches to audiovisual synthesis. The different synthesis steps are similar to typical concatenative speech synthesis but are generalized to the acoustic-visual domain. The bimodal synthesis was evaluated using perceptual and subjective evaluations. The overall outcome of the evaluation indicates that the proposed bimodal acoustic-visual synthesis technique provides intelligible speech in both acoustic and visual channels.
منابع مشابه
Introducing visual target cost within an acoustic-visual unit-selection speech synthesizer
In this paper, we present a method to take into account visual information during the selection process in an acoustic-visual synthesizer. The acoustic-visual speech synthesizer is based on the selection and concatenation of synchronous bimodal diphone units i.e., speech signal and 3D facial movements of the speaker’s face. The visual speech information is acquired using a stereovision techniqu...
متن کاملTowards a true acoustic-visual speech synthesis
This paper presents an initial bimodal acoustic-visual synthesis system able to generate concurrently the speech signal and a 3D animation of the speaker’s face. This is done by concatenating bimodal diphone units that consist of both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence i...
متن کاملSetup for acoustic-visual speech synthesis by concatenating bimodal units
This paper presents preliminary work on building a system able to synthesize concurrently the speech signal and a 3D animation of the speaker’s face. This is done by concatenating bimodal diphone units, that is, units that comprise both acoustic and visual information. The latter is acquired using a stereovision technique. The proposed method addresses the problems of asynchrony and incoherence...
متن کاملWeight Optimization for Bimodal Unit-Selection Talking Head Synthesis
This paper addresses talking head synthesis based on the concatenation of units comprising of both acoustic and visual information. Selection of appropriate diphone units to synthesize a given text string is based on the minimization of a weighted linear combination of four costs that reflect linguistic, acoustic, and visual considerations. We present initial work toward a method to determine a...
متن کاملSymbolic vs. acoustics-based style control for expressive unit selection
The present paper addresses the issue of flexibility in expressive unit selection speech synthesis by using different style selection techniques. We select units from a mixed-style unit selection database, using either forced style switching, no control, symbolic target cost, or acoustic target cost as a style selection criterion. We assess the effect of selection technique, feature weight and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- EURASIP J. Audio, Speech and Music Processing
دوره 2013 شماره
صفحات -
تاریخ انتشار 2013